Lecture 3.4 - Hypothesis Test Wisdom

Author

Student

Published

April 22, 2024

Hypothesis Testing Wisdom

Setting Expectations

Today we are working with a dataset on wages in the U.S. collected by the US Department of Labor: us.dol.wages.csv

We will first assume that the dataset approximately represents all workers in the U.S. – statistics of the mean/proportions can serve as stand-ins for the true population parameters.

  1. Subset your data and take a sample of 100 from the South region only using the slice_sample() command (documented here):
south_sample <- us.dol.wages %>%
    filter(south=="yes") %>% 
    slice_sample(n=100, replace=TRUE)
  1. Write down your expectations about the following variables and whether and how they might be different than the population. Use Google if needed.
ed - education (years)
wage - salary per week
bluecol - whether the worker works in a blue collar job
union - whether the person is in a union

You can find the proportions/means of these variables with the table() and summary() commands.

Confidence interval, \(p\) values, and \(\alpha\) values

First, consider the issue of an \(\alpha\) value.

  1. What, in your opinion, is a reasonable choice for the \(\alpha\) value? (i.e. how much proof would you need to be convinced there was a ‘real’ difference between the sample and the population)? Write down some reasons for your choice.

  2. Now, generate alternative and null hypotheses, fully specifying the \(\alpha\) level and the tailed-ness of the test for the following variables. The null hypothesis should be the overal population proportions/means.

ed
wage
bluecol
union
  1. Make a note of why you chose a one or two tailed hypothesis test.

  2. Next, make a small table with both the \(p\) value and the confidence interval for each of these variables. Check the conditions for hypothesis testing.

  3. Interpret your p-values and confidence intervals with respect to your \(\alpha\) level. In the end, which of the variables do you believe is statistically significantly different from the population, based on your sample?

  4. Add two additional columns to your table with the ‘true’ value of the variables for just the south region and the ‘true’ value of your variables for the overall dataset. Were your conclusions correct or not?

Errors & Effect Size

For the same variables as those listed above, put yourself in the shoes of a policymaker that is considering additional government programs to help people in the south region if it can be shown that they are different on some important variables.

  1. Add another column to your table. Assess whether you think a Type I or Type II error would be more serious for each of the variables. Provide a justification for why.

  2. Add another column to your table with the size of the difference. Assess whether the difference is substantively large or not. How do you know if that is a large difference? You may want to check Google, etc. to see what the normal range of variation for these variables are.

  3. Overall, with your partner, write up a summary paragraph with the results of your findings and how we should interpret the results of your calculations and thoughts based on your 100-person sample of Southern workers.

Extra time

  1. Develop a regression model that predicts wage. Is the variable south an important predictor in your model? Why or why not?